Skip to content

fix: convergence issue by adding use_inductor=False in vllm compilation_config#1014

Merged
terrykong merged 11 commits intomainfrom
zhiyul/deepscaler_recipe_convergence_fix
Sep 9, 2025
Merged

fix: convergence issue by adding use_inductor=False in vllm compilation_config#1014
terrykong merged 11 commits intomainfrom
zhiyul/deepscaler_recipe_convergence_fix

Conversation

@ZhiyuLi-Nvidia
Copy link
Contributor

@ZhiyuLi-Nvidia ZhiyuLi-Nvidia commented Aug 28, 2025

What does this PR do ?

Closes #998.

Looks like it can be resolved with the compilation flag {"use_inductor": False}.
"With this flag, vllm will use the custom CUDA kernels instead of the Triton kernels generated by torch.compile "which might cause numerical issue here.

There's no logprob error spikes in 140 steps and rewards were increasing stably. The speed performance looks similar.
https://wandb.ai/nvidia/grpo-dev-zhiyul/workspace?nw=nwuserzhiyul

  • Rewards of 140 steps
Image
  • logprob error w/ and w/o the change
image

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

@parthchadha
Copy link
Contributor

@ZhiyuLi-Nvidia good find! Can you share performance on larger qwen models as well? Also, please attach the plots to the PR description since not everyone can access internal wandb reports.

Copy link
Contributor

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice find @ZhiyuLi-Nvidia !

is it possible to construct a model diagnostic test for this?

https://github.com/NVIDIA-NeMo/RL/tree/main/tools/model_diagnostics

might be helpful for others who are debugging their model run

@ZhiyuLi-Nvidia
Copy link
Contributor Author

Thank you @parthchadha

@ZhiyuLi-Nvidia good find! Can you share performance on larger qwen models as well?

Which model do you recommend?

Also, please attach the plots to the PR description since not everyone can access internal wandb reports.

Added the key screenshots.

…lation_config

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
@ZhiyuLi-Nvidia ZhiyuLi-Nvidia force-pushed the zhiyul/deepscaler_recipe_convergence_fix branch from 6883f11 to b3aae4f Compare August 28, 2025 17:46
@parthchadha
Copy link
Contributor

Thank you @parthchadha

@ZhiyuLi-Nvidia good find! Can you share performance on larger qwen models as well?

Which model do you recommend?

Also, please attach the plots to the PR description since not everyone can access internal wandb reports.

Added the key screenshots.

Let's run qwen 32b from #957 (we can try with 32k osl)

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Aug 29, 2025
@ZhiyuLi-Nvidia
Copy link
Contributor Author

is it possible to construct a model diagnostic test for this?

https://github.com/NVIDIA-NeMo/RL/tree/main/tools/model_diagnostics

might be helpful for others who are debugging their model run

Good suggestion. Added.

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
@ZhiyuLi-Nvidia ZhiyuLi-Nvidia force-pushed the zhiyul/deepscaler_recipe_convergence_fix branch from 55191aa to f5bf231 Compare August 29, 2025 18:46
Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
@ZhiyuLi-Nvidia
Copy link
Contributor Author

@terrykong added output example 2ba5e3e

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
@ZhiyuLi-Nvidia
Copy link
Contributor Author

Thank you @parthchadha

@ZhiyuLi-Nvidia good find! Can you share performance on larger qwen models as well?

Which model do you recommend?

Also, please attach the plots to the PR description since not everyone can access internal wandb reports.

Added the key screenshots.

Let's run qwen 32b from #957 (we can try with 32k osl)

@parthchadha I kept get OOM in the middle of training. Shall we go back to it once merged or in a more stable state?

parthchadha
parthchadha previously approved these changes Sep 2, 2025
@terrykong terrykong added this pull request to the merge queue Sep 2, 2025
@ZhiyuLi-Nvidia ZhiyuLi-Nvidia removed this pull request from the merge queue due to a manual request Sep 2, 2025
terrykong
terrykong previously approved these changes Sep 2, 2025
@terrykong terrykong added this pull request to the merge queue Sep 2, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Sep 3, 2025
Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
@ZhiyuLi-Nvidia ZhiyuLi-Nvidia force-pushed the zhiyul/deepscaler_recipe_convergence_fix branch from 1bcb7ae to dddcbf0 Compare September 3, 2025 17:41
@terrykong terrykong changed the title fix: fix convergence issue by adding use_inductor=False in vllm compi… fix: fix convergence issue by adding use_inductor=False in vllm compilation_config Sep 3, 2025
@ZhiyuLi-Nvidia ZhiyuLi-Nvidia changed the title fix: fix convergence issue by adding use_inductor=False in vllm compilation_config fix: convergence issue by adding use_inductor=False in vllm compilation_config Sep 3, 2025
@terrykong terrykong enabled auto-merge September 3, 2025 19:17
terrykong
terrykong previously approved these changes Sep 3, 2025
@terrykong terrykong added this pull request to the merge queue Sep 3, 2025
Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
@ZhiyuLi-Nvidia ZhiyuLi-Nvidia removed this pull request from the merge queue due to a manual request Sep 3, 2025
parthchadha
parthchadha previously approved these changes Sep 3, 2025
@terrykong terrykong added this pull request to the merge queue Sep 4, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to no response for status checks Sep 4, 2025
Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
terrykong
terrykong previously approved these changes Sep 4, 2025
@terrykong terrykong added this pull request to the merge queue Sep 4, 2025
Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
@ZhiyuLi-Nvidia ZhiyuLi-Nvidia removed this pull request from the merge queue due to a manual request Sep 5, 2025
@terrykong terrykong added this pull request to the merge queue Sep 8, 2025
Merged via the queue into main with commit 1c85276 Sep 9, 2025
21 checks passed
@terrykong terrykong deleted the zhiyul/deepscaler_recipe_convergence_fix branch September 9, 2025 00:51
guyueh1 pushed a commit to guyueh1/NeMo-RL that referenced this pull request Sep 15, 2025
…on_config (NVIDIA-NeMo#1014)

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
HeyyyyyyG pushed a commit that referenced this pull request Oct 3, 2025
…on_config (#1014)

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
PrinsYin pushed a commit to PrinsYin/RL that referenced this pull request Nov 30, 2025
…on_config (NVIDIA-NeMo#1014)

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GRPO Convergence Issue with vllm cuda graph enabled

3 participants